Anthesis assessment

Skills Assessment

Tiqvah Potgieter

All my work is conducted in rStudio, using R. Please note that the graphs are zoomable.The general layout of this assessment starts with each task and the questions. Below each question is my answer but all my workings for that answer is displayed in R (which follows the task section). Please also read the commented code which helps explain my thinking process.

Task 1: Data Interpretation and Analysis

Questions:

1.Which province-municipality-district combination uses the most Diesel Consumption and in which year respectively?

Answer 1: Free State-Ngwathe-Fezile Dabi has the most Diesel consumption, specifically year 2020 with a total of 76107.5 for that year

2.What is the average Diesel Consumption for KwaZulu-Natal?

Answer 2: The average diesel consumption for KwaZulu-Natal is 120.6709

3. Identify which farms in KwaZulu-Natal have used more than the average Diesel Consumption for the same province.

Answer 3:

1 Durleigh, 2 Broadview, 3 Kenlei Farms, 4 Stone Ridge, 5 Colbourne Farm, 6 Glenisla, 7 Spring Valley, 8 Baynesfield, 9 Der, 10 Glen Read, 11 Sugarbush, 12 Avogrow, 13 Sarsgrove.

4a. Who emitted the lowest carbon dioxide emissions rate overall and in what year?

Answer 4a: FarmID=570 Farm=Rietfontein Year=2022 co2_emission_rate=0.09921795

4b. When comparing the avg 2017-2019 carbon dioxide emissions rate with that of the avg 2020-2021 rate per farm, which farm increased their avg rate by the largest delta?

Answer 4b: FarmID=552 with a delta of 21909.577

############################### TASK1 #########################################################

##############Question: 1 

#Loading libraries that I will need for this assessment 
library(dplyr)
library(readr)
library(tidyr)

#Reading in all csv files 
farm_details <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/FarmDetails.csv")
farms <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/Farms.csv")

#Merging farm_details and farms by their FarmID,including all rows from both datasets and filling NAs for missing values (if any)
farms_merged <- full_join(farm_details,farms, by = "FarmID")

#Checked to see if there are any NAs,results showed that there are NAs. This is important to note especially when you are going to do calculations 
any(is.na(farms_merged))
[1] TRUE
#Creating a new colum called DieselConsumption which will be calculated as QuantityDiesel/FarmSize ,assuming QuantityDiesel is in tonnes and FarmSize in Ha
farms_merged <- farms_merged %>% 
  mutate(DieselConsumption = QuantityDiesel/FarmSize) 

#Showing the first 10 rows of farms_merged
head(farms_merged,10)
   FarmID Year QuantityDiesel QuantityPetrol     FarmName      Province
1     108 2022        75900.0              0     Durleigh KwaZulu-Natal
2     528 2022        91194.0              0    Broadview KwaZulu-Natal
3     688 2016        29369.0              0 Kenlei Farms KwaZulu-Natal
4     688 2017        31010.0              0 Kenlei Farms KwaZulu-Natal
5     688 2018        31865.0              0 Kenlei Farms KwaZulu-Natal
6     688 2019        32022.0              0 Kenlei Farms KwaZulu-Natal
7     688 2020        36688.8              0 Kenlei Farms KwaZulu-Natal
8     688 2021        39273.0              0 Kenlei Farms KwaZulu-Natal
9     701 2018         5135.5              0        April    Free State
10    701 2019         5088.5              0        April    Free State
           Municipality      District FarmSize DieselConsumption
1               Mpofana Umgungundlovu 463.4549         163.76999
2  Inkosi Langalibalele      Uthukela 523.2589         174.28085
3             uMshwathi Umgungundlovu 215.1328         136.51567
4             uMshwathi Umgungundlovu 215.1328         144.14351
5             uMshwathi Umgungundlovu 215.1328         148.11780
6             uMshwathi Umgungundlovu 215.1328         148.84758
7             uMshwathi Umgungundlovu 215.1328         170.54023
8             uMshwathi Umgungundlovu 215.1328         182.55234
9           Metsimaholo   Fezile Dabi  86.6518          59.26594
10          Metsimaholo   Fezile Dabi  86.6518          58.72354
#Summarizing total diesel consumption by province-municipality-district combination and year 
farms_summary <- farms_merged %>% 
  group_by(Province,Municipality,District,Year) %>% 
  summarise(SummedDieselConsumption = sum(DieselConsumption, na.rm = TRUE)) %>% 
  arrange(,desc(SummedDieselConsumption))
  
head(farms_summary,1)
# A tibble: 1 × 5
# Groups:   Province, Municipality, District [1]
  Province   Municipality District     Year SummedDieselConsumption
  <chr>      <chr>        <chr>       <int>                   <dbl>
1 Free State Ngwathe      Fezile Dabi  2020                  76107.
#Answer 1: Free State-Ngwathe-Fezile Dabi has the most Diesel consumption,specifically year 2020 with a total of 76107.5 for that year 

##############Question: 2

# Only selecting entries where the province = KwaZulu-Natal from farms_merged 
kwaZuluNatal <- farms_merged %>% 
  filter(Province == "KwaZulu-Natal") 

#Obtaining the mean Diesel consumption for KwaZulu-Natal, ignoring all NAs
Ave_KwazuluNatal <- mean(kwaZuluNatal$DieselConsumption, na.rm = TRUE) 

#Answer 2: The average diesel consumption for KwaZulu-Natal is 120.6709 
print(Ave_KwazuluNatal)
[1] 120.6709
##############Question: 3 

above_ave <-  list(kwaZuluNatal %>% 
  filter(DieselConsumption > Ave_KwazuluNatal) %>% 
  select(FarmName) %>% 
  distinct() ) 

#Answer 3: The farms in KwaZulu-Natal that has used more than the average Diesel Conumption of KwaZulu-Natal are:       

print(above_ave) 
[[1]]
         FarmName
1        Durleigh
2       Broadview
3    Kenlei Farms
4     Stone Ridge
5  Colbourne Farm
6        Glenisla
7   Spring Valley
8     Baynesfield
9             Der
10      Glen Read
11      Sugarbush
12        Avogrow
13      Sarsgrove
##############Question: 4a  

#Asuming to only calculate the carbon emissions rate from diesel only 

#Diesel emission factor 
diesel_ef <- 0.002886 

#Calculating diesel comsumption in liters (assuming Diesel quantity is given in tonnes),so to get to liters I am only going to times Diesel by 1176.4706, assuming diesel has a density: 850 kg/m3.
farms_merged2 <- farms_merged %>% 
  mutate(Diesel_liters = QuantityDiesel*1176.47) 

#Showing the first five rows of farms_merged
head(farms_merged2,5)
  FarmID Year QuantityDiesel QuantityPetrol     FarmName      Province
1    108 2022          75900              0     Durleigh KwaZulu-Natal
2    528 2022          91194              0    Broadview KwaZulu-Natal
3    688 2016          29369              0 Kenlei Farms KwaZulu-Natal
4    688 2017          31010              0 Kenlei Farms KwaZulu-Natal
5    688 2018          31865              0 Kenlei Farms KwaZulu-Natal
          Municipality      District FarmSize DieselConsumption Diesel_liters
1              Mpofana Umgungundlovu 463.4549          163.7700      89294073
2 Inkosi Langalibalele      Uthukela 523.2589          174.2808     107287005
3            uMshwathi Umgungundlovu 215.1328          136.5157      34551747
4            uMshwathi Umgungundlovu 215.1328          144.1435      36482335
5            uMshwathi Umgungundlovu 215.1328          148.1178      37488217
#Calculating the co2 emissions from diesel (EFF)
farms_merged2 <- farms_merged2 %>% 
  mutate(EFF = Diesel_liters*diesel_ef)

#Calculating the co2 emission rate which is just EFF/farmsize
farms_merged2 <- farms_merged2 %>% 
  mutate(co2_emission_rate = EFF/FarmSize)

#Getting the farm with the lowest co2 emissions rate 
farm_lowest <- farms_merged2 %>% 
  filter(co2_emission_rate == min(co2_emission_rate,na.rm = TRUE)) %>% 
  select(FarmID, FarmName, Year, co2_emission_rate)

#Answer 4a: FarmID=570  Farm=Rietfontein    Year=2022   co2_emission_rate=0.09921795
print(farm_lowest) 
  FarmID    FarmName Year co2_emission_rate
1    570 Rietfontein 2022        0.09921795
##############Question: 4b  

#Here I am filtering for years between 2017-2021 and then grouping by FarmID, then creating 2 new groups,all entries from years 2017-2019 as one group and then grouping all entries from 2020-2021 as the second group,I am then calculating the average co2_emission_rate per FarmID per period group.
  periods <- farms_merged2 %>%
  filter(Year %in% 2017:2021) %>%
  group_by(FarmID, period = ifelse(Year <= 2019, "2017-2019", "2020-2021")) %>%
  summarize(avg_CO2_rate = mean(co2_emission_rate, na.rm = TRUE)) %>% 
  pivot_wider(names_from = period, values_from = avg_CO2_rate) 

#Showing the first 10 rows from periods 
head(periods,10) 
# A tibble: 10 × 3
# Groups:   FarmID [10]
   FarmID `2017-2019` `2020-2021`
    <int>       <dbl>       <dbl>
 1     64        9.77        9.77
 2     65       75.1        74.1 
 3     67       52.3       624.  
 4     68      146.        152.  
 5     69     1131.       1521.  
 6     71      232.        232.  
 7     72      210.        210.  
 8     74       30.2        17.5 
 9     75      811.        829.  
10     78      199.        191.  
#Calculating the difference between the averages of co2_emission_rates between the two periods
periods <- periods %>%
  mutate(delta = `2020-2021` - `2017-2019`) 

#Showing the first 10 rows from periods 
head(periods,10) 
# A tibble: 10 × 4
# Groups:   FarmID [10]
   FarmID `2017-2019` `2020-2021`  delta
    <int>       <dbl>       <dbl>  <dbl>
 1     64        9.77        9.77   0   
 2     65       75.1        74.1   -1.06
 3     67       52.3       624.   571.  
 4     68      146.        152.     6.38
 5     69     1131.       1521.   391.  
 6     71      232.        232.     0   
 7     72      210.        210.     0   
 8     74       30.2        17.5  -12.7 
 9     75      811.        829.    18.0 
10     78      199.        191.    -7.98
#Arranging periods from largest to smallest 
max_delta_farm <- periods %>%
     arrange(,desc(delta))

#Showing the first 10 rows of max_delta_farm
#Answer 4b: FarmID=552 with a delta of 21909.577
head(max_delta_farm,10)
# A tibble: 10 × 4
# Groups:   FarmID [10]
   FarmID `2017-2019` `2020-2021`  delta
    <int>       <dbl>       <dbl>  <dbl>
 1    552      30495.      52405. 21910.
 2    549      22979.      39488. 16509.
 3    771      18501.      31793. 13292.
 4    566      11573.      19888.  8315.
 5    548       7348.      12628.  5280.
 6    543      14287.      19440.  5154.
 7    559       4938.       8485.  3547.
 8    554       4187.       7195.  3008.
 9    551       6850.       9320.  2471.
10    556       3377.       5803.  2426.

Task 2: Business Intelligence and Statistical Analysis

Questions:

1.Create a meaningful visual representation/s from the two datasets provided of crop Insights and trends. You may focus on any aspects you wish. Show us your skills!

Answer 1: See code

2.Using the Crop details Yield dataset calculate the mean, median and standard deviation of each of crop type.

Answer 2:

CropType Mean Median Standard Deviation
Barley 2.841563 3.00 1.1350874
Cowpea 0.000000 0.00 0.0000000
Dry bean 2.353443 2.50 0.2412837
Groundnut 1.495000 1.70 0.7950681
Kikuyu 3.070741 0.00 5.8948164
Lucerne (Alfalfa) 16.909498 0.00 35.6801068
Maize 8.963470 7.80 7.6242254
Millet 10.082353 10.00 2.6740034
Oats 1.604579 0.00 8.3857792
Pea 0.800000 1.00 0.4472136
Potato 33.204225 35.00 2.4072548
Rye-grass 10.313676 12.00 8.0654036
Sorghum 16.556647 10.00 14.6983838
Soybeans 3.029310 2.50 1.5259881
Sugarcane 80.970356 85.00 12.0043945
Sunflower 1.839595 1.92 0.6419304
Vetch (Fodder) 16.117647 18.00 6.1020729
Wheat 2.934564 2.74 1.6788732

3.Using the Crop details dataset perform a hypothesis test to determine if there is a statistical difference ( p < 0.05) between the PPN-Fixing yields/ha in 2018 and 2021

Answer 3: There is a significant difference between the means of the two groups

############################### TASK2 #########################################################

##############Question: 1

#Loading relevant libraries 
library(ggplot2)
library(pheatmap)
library(stringr)
library(scales)

#Reading in all csv files 
crop_details <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/CropDetails.csv") 
fields <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/Fields.csv") 

#Merging crop_details and fields by their FieldID,including all rows from both datasets and filling NAs for missing values (if any)
fields_merged <- full_join(crop_details,fields ,by = "FieldID") 

#Calculating the yield per ha 
fields_merged <- fields_merged %>% 
  mutate(yields_per_ha = Yield/FieldSize)

#Note that many fieldSizes have a value of 0, and when dividing by 0 you would get undefined,so now I am going to replace these values with NA (it would get ignored during calculations) 
fields_merged$yields_per_ha[fields_merged$yields_per_ha == "Inf"] <- NA 

#viewing the total yield for every crop type per year 
#Note that when scaling the y-axis to log10, on the y-axis the values are log10 transformed, but the values are still the same, so the values are not changed, only the scale is changed

#Calculatind the total yield for every croptype per year and removing any NAs
data1 <- fields_merged %>% 
  group_by(CropType,Year) %>% 
  summarise(TotalYield_perYear = sum(Yield, na.rm = TRUE)) %>% 
  mutate_at(c("CropType","Year"),as.factor) %>% 
  mutate(Year = as.character(Year)) %>% 
  na.omit() 

ggplot(data1, aes(x=Year, y= TotalYield_perYear, group=CropType)) +
  geom_line(aes(color=CropType)) +  
  geom_point(aes(color=CropType)) +
  facet_wrap(~CropType,scales = "free") +
  scale_y_continuous(limits = c(0.0,NA)) +
theme(plot.title = element_text(size = 35 ,hjust = 0.5), plot.subtitle = element_text(size=30,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 25),legend.text = element_text(size = 20),axis.text.x = element_text(size = 20,angle = 45),axis.text.y = element_text(size = 20),strip.text = element_text(size=30),legend.key.spacing.y = unit(20, "pt")) +
  labs(title = "Yield trends over the years:",subtitle = "Total yield for every crop type per year", x = "Year", y = "Yield", color = "Crop Type") 

#Viewing every croptype on the same graph and log transforming the Yield to make it more comparable
ggplot(data1, aes(x=Year, y= (TotalYield_perYear), group=CropType)) + 
  geom_line(aes(color=CropType),linewidth =1) + 
  geom_point(aes(color=CropType),size = 3) +
 scale_y_log10() +
  theme(plot.title = element_text(size = 35 ,hjust = 0.5), plot.subtitle = element_text(size=30,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 25),legend.text = element_text(size = 20),axis.text.x = element_text(size = 20,angle = 45),axis.text.y = element_text(size = 20),legend.key.spacing.y = unit(20, "pt")) +
  labs(title = "Yield trends over the years",subtitle ="Total yield for every crop type per year",x = "Year", y = "Log 10 Transformed Yield", color = "Crop Type") 

#viewing the average yield for every crop type 
###Calculating the average yield for every crop type and rounding to 2 decimal places 
data2 <- fields_merged %>% 
  group_by(CropType) %>% 
  summarise(AveYield = round(mean(Yield,na.rm = TRUE),2)) %>% 
  mutate_at("CropType",as.factor) %>% 
  na.omit() 
  

ggplot(data2, aes(x=CropType, y= AveYield,fill = CropType)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = AveYield),size = 7) +
  theme(plot.title = element_text(size = 37 ,hjust = 0.5), plot.subtitle = element_text(size=30,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 30),legend.text = element_text(size = 20),axis.text.x = element_text(size = 20,angle = 25),axis.text.y = element_text(size = 25),legend.key.spacing.y = unit(20, "pt"))  +
  labs(title = "Yield trends:",subtitle = "Average yield for every crop type", x = "CropType", y = "Average Yield", color = "Crop Type") 

####viewing the average yield_perHa for every crop type per year
#Calculating the average yield per ha for every crop type per year and rounding to 2 decimal places 
data3 <- fields_merged %>% 
  group_by(CropType,Year) %>% 
  summarise(AveYield_perHa = round(mean(yields_per_ha,na.rm = TRUE),2)) %>% 
  mutate_at("CropType",as.factor) %>% 
  mutate(Year = as.character(Year)) %>% 
 na.omit() 
 
ggplot(data3, aes(x=Year, y= AveYield_perHa, group=CropType)) +
  geom_line(aes(color=CropType)) +  
  geom_point(aes(color=CropType)) +
  facet_wrap(~CropType,scales = "free") +
  scale_y_continuous(limits = c(0.0,NA)) +
theme(plot.title = element_text(size = 35 ,hjust = 0.5), plot.subtitle = element_text(size=30,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 25),legend.text = element_text(size = 20),axis.text.x = element_text(size = 20,angle = 45),axis.text.y = element_text(size = 20),strip.text = element_text(size=30),legend.key.spacing.y = unit(20, "pt")) +
  labs(title = "Yield trends over the years:",subtitle = "Average yield per hectare for every crop type per year", x = "Year", y = "Yield", color = "Crop Type") 

#Viewing every croptype on the same graph and log transforming the average Yield/ha to make it more comparable
ggplot(data3, aes(x=Year, y= AveYield_perHa, group=CropType)) + 
  geom_line(aes(color=CropType),linewidth =1) + 
  geom_point(aes(color=CropType),size = 3) +
 scale_y_log10(labels = comma) +
  theme(plot.title = element_text(size = 35 ,hjust = 0.5), plot.subtitle = element_text(size=30,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 25),legend.text = element_text(size = 20),axis.text.x = element_text(size = 20,angle = 45),axis.text.y = element_text(size = 20),strip.text = element_text(size=30),legend.key.spacing.y = unit(20, "pt")) +
  labs(title = "Yield trends over the years",subtitle ="Average yield per hectare for every crop type per year",x = "Year", y = "Log 10 Transformed average Yield per hectare", color = "Crop Type") 

data3.2 <- fields_merged %>% 
  group_by(CropType) %>% 
  na.omit() 

#Note the triangle represents the mean of the yield per hectare for every crop type 

ggplot(data3.2, aes(x=CropType, y= yields_per_ha, group_by=CropType)) + 
 geom_boxplot(aes(color = CropType)) +
  theme(plot.title = element_text(size = 40 ,hjust = 0.5), plot.subtitle = element_text(size=35,hjust = 0.5), axis.title.x= element_text(size = 30),axis.title.y =element_text(size = 30),legend.title = element_text(size = 30),legend.text = element_text(size = 25),axis.text.x = element_text(size = 20,angle = 15),axis.text.y = element_text(size = 20),legend.key.spacing.y = unit(20, "pt")) +
 scale_y_log10(labels = comma) +
stat_summary(fun = "mean", geom = "point",shape = 2, size = 3) +
 labs(title = "Yield per hectare trends:",subtitle ="Summary of yield per hectare for every crop",x = "CropType", y = "Yield per hectare", color = "Crop Type") 

##############Question: 2 

crop_summarized <- crop_details %>% 
  group_by(CropType) %>% 
  summarise(mean = mean(Yield), median = median(Yield), standard_deviation = sd(Yield)) 

#Answer 2: 
crop_summarized
# A tibble: 18 × 4
   CropType           mean median standard_deviation
   <chr>             <dbl>  <dbl>              <dbl>
 1 Barley             2.84   3                 1.14 
 2 Cowpea             0      0                 0    
 3 Dry bean           2.35   2.5               0.241
 4 Groundnut          1.50   1.7               0.795
 5 Kikuyu             3.07   0                 5.89 
 6 Lucerne (Alfalfa) 16.9    0                35.7  
 7 Maize              8.96   7.8               7.62 
 8 Millet            10.1   10                 2.67 
 9 Oats               1.60   0                 8.39 
10 Pea                0.8    1                 0.447
11 Potato            33.2   35                 2.41 
12 Rye-grass         10.3   12                 8.07 
13 Sorghum           16.6   10                14.7  
14 Soybeans           3.03   2.5               1.53 
15 Sugarcane         81.0   85                12.0  
16 Sunflower          1.84   1.92              0.642
17 Vetch (Fodder)    16.1   18                 6.10 
18 Wheat              2.93   2.74              1.68 
##############Question: 3 
#3.Using the Crop details dataset perform a hypothesis test to determine if there is a statistical difference ( p < 0.05) between the PPN-Fixing yields/ha in 2018 and 2021

#By only using the crop details dataset I am assuming that yields are yields/ha, as there is no details about farmsize or fieldsize in the crop details dataset. I am also assuming that all yields are PPN-Fixing yields because there is no information regarding which FieldID or CropType is a PPn-Fixing yield

#Getting all yields from year 2018
yield_2018 <- crop_details %>% 
  filter(Year == 2018) %>% 
  pull(Yield)  

#Getting all yields from year 2021
yield_2021 <- crop_details %>% 
  filter(Year == 2021) %>% 
  pull(Yield) 

# If the p-value is less than 0.05, you would reject the null hypothesis and conclude that there is a significant difference between the means of the two groups.
test <- t.test(yield_2018,yield_2021)

#Answer 3: There is a significant difference between the means of the two groups
print(test) 

    Welch Two Sample t-test

data:  yield_2018 and yield_2021
t = 3.4399, df = 4919.2, p-value = 0.0005867
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.140378 4.162547
sample estimates:
mean of x mean of y 
 18.88575  16.23429 

Task 3: Basic Data Science Skills (Code, Problem Solving, …)

Questions

1.Calculate the Euclidean distance between each point in Dataset_1 and the nearest point in Dataset_2.

Answer 1: See code

2.If the distance between a point in Dataset_1 and its nearest point in Dataset_2 is less than a user-defined threshold (which you can change), move the point from Dataset_1 to Dataset_2.

Answer 2: See code

3.Repeat this process until no more points in Dataset_1 can be moved within the threshold distance.

Answer 3: See code

############################### TASK3 #########################################################

#Loading relevant libraries 
library(tidyverse)
library(magrittr)

cart1 <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/Cartesian_Dataset_1.csv") 
cart2 <- read.csv(file = "/Users/tiqvah/Desktop/Skill Assessment/Cartesian_Dataset_2.csv") 

#Writing a function to calculate the euclidean distance between a point in dataset1(cart1) and dataset2(cart2)
Euclidean <- function(a,b){ 
  return(sqrt((a$X-b$X)^2 + (a$Y -b$Y)^2))
  }

#Creating 2 lists to store the distances 
list1 <- list()
list2 <- list()  

#Calculating the distance between each point in cart1 and every other point in cart2
for (a in 1:nrow(cart1)) { 
  
  for (b in 1:nrow(cart2)) { 
    #list1 contains all the euclidean distances for every point in cart1 to a specific point in cart2
    list1[[b]] <- Euclidean(cart1[b,],cart2[a,])
    
  } 
  
  #taking all the values from list1 and creating a column (the colum contains 50 rows which represents all the points in cart1 in relation to a second point in cart2)
  list2[[a]] <- as.data.frame(do.call(rbind,list1)) 
  }  

#creating a dataframe 50x50,
cart_distances <- as.data.frame(do.call(cbind,list2)) 

#Each row represents a point in cart1, and each column represents a point in cart2 
colnames(cart_distances) <- cart2$Object_ID 

#showing the first 20 rows of cart_distances, row1(objectID 1 from cart1) and colum1 (objectID 51 from cart2) have a euclidean distance of 1249.5
head(cart_distances,5)
        51        52       53       54       55        56       57       58
1 1249.520  393.5594 1381.408 2981.131 1017.159 2164.1490 2303.175 1725.877
2 2980.779 1697.5070 2064.338 4034.652 1575.950  879.8295 1356.538 3444.660
3 1879.966 1484.9350 2527.639 3739.780 2225.154 2843.0002 3502.119 2199.392
4 1443.629  684.2003 1778.186 3263.120 1420.679 2276.6405 2689.577 1880.928
5 2607.188 2819.9433 1576.038 2121.891 1972.022 3519.2241 1349.615 2759.299
        59        60        61       62       63        64        65       66
1 1115.565  931.6786  684.8080 2184.700 3689.337 2115.6061 2436.2276 3009.862
2 2369.897 2820.6815 2687.8290 3299.946 2383.497  817.1854 2337.9991 2040.532
3 2141.588 1523.6542 1252.1761 3022.166 4890.277 2818.8483 3608.2541 4217.860
4 1467.581 1083.2045  785.4203 2485.080 4077.709 2238.0080 2840.7393 3407.807
5 2020.283 2799.9146 2954.5621 1851.025 1612.194 3449.3237  498.9128 1009.881
        67        68       69       70       71       72       73       74
1 1817.460 2670.3846 1236.601 1602.914 1557.925 2139.163 1237.847 2682.125
2  794.688  720.4422 1407.388 3456.883 3306.199 1264.521 3189.023 1575.447
3 2551.473 3732.5022 2443.851 1935.082 2050.382 2646.121 1547.936 3881.346
4 1947.153 2972.0042 1633.638 1698.357 1712.916 2177.132 1298.299 3068.812
5 3292.492 2460.1482 1837.601 2999.767 2746.983 3833.074 3060.720 1295.484
        75       76       77       78       79        80       81        82
1 1815.515 1107.571 1329.217 1406.457 2178.867 2398.1701 1281.724 1476.7972
2 2917.852 1284.136 2648.563 1214.213 1120.486 2748.5394 3363.140  640.5693
3 2730.030 1824.441 2253.650 2027.842 2754.691 3491.6107 1081.247 2539.6805
4 2138.097 1200.034 1646.238 1468.277 2246.850 2783.8881 1149.637 1770.7956
5 1766.495 3204.323 2040.100 3388.824 3729.029  810.1586 3598.722 2432.0027
         83       84       85        86       87       88        89       90
1  984.7827 2115.985 2710.094  160.3122 2137.786 1491.557 2278.1600 1712.262
2 2950.8785 2893.110 2060.638 2241.8066 3515.584 1648.668 2721.0968 2540.357
3 1407.2050 3115.375 3916.158 1137.7368 2819.776 1817.199 3360.3931 2759.576
4 1064.9113 2471.162 3114.530  386.0324 2381.866 1427.320 2660.2331 2080.041
5 2983.9253 1351.030  691.055 2972.2370 2282.059 3789.715  938.3613 1474.810
         91       92        93       94       95        96        97       98
1 2327.0144 1235.335  254.7724 2570.051 1859.098 2888.8863 1620.6310 1857.573
2  478.7745 2880.205 2175.5625  534.627 2737.252 2000.9690  965.0699 1844.656
3 3399.5336 1958.727 1344.8636 3592.892 2862.266 4097.2944 2315.6118 3052.680
4 2632.7030 1467.382  598.7520 2850.621 2213.309 3288.5171 1726.1402 2264.267
5 2342.5356 2462.692 2760.3612 2621.328 1491.347  924.5155 3318.6916 1111.180
         99       100
1  409.7633 1602.4771
2 2064.3316  994.2635
3 1556.3319 2287.2101
4  795.6915 1702.5181
5 2551.0180 3332.5825
#showing a quick statistical summary of each column of cart_distances (only the first 5 columns)
summary(cart_distances[,c(1:5)]) 
       51               52               53               54        
 Min.   : 205.8   Min.   : 190.2   Min.   : 628.1   Min.   : 220.3  
 1st Qu.:1279.6   1st Qu.:1204.5   1st Qu.:1189.2   1st Qu.:2116.0  
 Median :2154.3   Median :1788.5   Median :1708.5   Median :3165.2  
 Mean   :2109.0   Mean   :1760.5   Mean   :1694.6   Mean   :2954.9  
 3rd Qu.:2976.1   3rd Qu.:2398.7   3rd Qu.:2078.7   3rd Qu.:3750.4  
 Max.   :3803.5   Max.   :3020.7   Max.   :3226.3   Max.   :5221.5  
       55        
 Min.   : 450.7  
 1st Qu.:1234.4  
 Median :1653.0  
 Mean   :1591.1  
 3rd Qu.:1959.2  
 Max.   :2643.2  
#created a function to normalize the data
data_norm <- function(x) {
  ((x-min(x))/ (max(x)-min(x)))
}

#normalized the data and saved as cart_distances_normalized
cart_distances_normalized <- cart_distances
cart_distances_normalized <- as.data.frame(lapply(cart_distances_normalized,data_norm)) 

#showing a quick statistical summary of the normalized data (only the first 5 columns)
summary(cart_distances_normalized[,c(1:5)])
      X51              X52              X53              X54        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.2985   1st Qu.:0.3583   1st Qu.:0.2160   1st Qu.:0.3790  
 Median :0.5416   Median :0.5647   Median :0.4158   Median :0.5888  
 Mean   :0.5290   Mean   :0.5548   Mean   :0.4105   Mean   :0.5468  
 3rd Qu.:0.7700   3rd Qu.:0.7802   3rd Qu.:0.5583   3rd Qu.:0.7058  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
      X55        
 Min.   :0.0000  
 1st Qu.:0.3575  
 Median :0.5483  
 Mean   :0.5201  
 3rd Qu.:0.6880  
 Max.   :1.0000  
#Since there is no mention of normalizing the data in the task description, i am just going to use the original values

##############Question: 2
  
#Setting a threshold    
#Creating a function to get the top 5 nearest points from cart1 for a specific point in dataset2 (cart2)   
    top_5_nearest <- function(x){
            sort(x)[1:5]
   }
  
#Setting my threshold as mean of the 5 nearest points,but can change to anything if needed
top_5_values <- apply(cart_distances,2,top_5_nearest)
my_threshold <- mean(top_5_values)
my_threshold 
[1] 523.3512
#Moving the lowest points from dataset1 (cart1) to dataset2 (cart2) if the distance is below 523.3511
#Writing a function to get the index of points that are lower than the threshold 
index <- function(x){ 
  which(x < my_threshold) 
  }

#Getting the indices of points that are lower than the threshold 
points_to_move <- apply(cart_distances,2,index) 

#Point to move shows the indices of the points of cart1 (dataset1) that are the closest (below the threshold) to each each point in cart2 (dataset2),example objectIDs 6,7,10,33,49,50 from cart1 are the closesest to ObjectID 51

head(points_to_move)
$`51`
[1]  6  7 10 33 49 50

$`52`
[1]  1 16 42

$`53`
integer(0)

$`54`
[1] 28

$`55`
[1] 45

$`56`
[1]  8 12
#Removing duplicate indices
points_to_move <- sort(unique(unlist(points_to_move))) 

#Only adding the points that were below the threshold from cart1 to cart2,note the modified cart2 is called cart2.1
for (x in points_to_move) { 
  
  if(x == 1) {
  cart2.1 <- rbind(cart2,cart1[x,]) 
  
  } else {
  cart2.1 <- rbind(cart2.1,cart1[x,])
  
  }
} 

print(cart2.1) 
    Object_ID    X    Y
1          51 1981 3468
2          52 2798 2162
3          53 1465 2362
4          54  164 3882
5          55 1960 2032
6          56 3263  433
7          57 1124 1010
8          58 1754 3901
9          59 1744 2796
10         60 2325 3335
11         61 2592 3195
12         62  829 3424
13         63   31  154
14         64 3203  471
15         65  491 1883
16         66  370  824
17         67 3142  763
18         68 1806   89
19         69 1843 1813
20         70 2056 3956
21         71 1873 3781
22         72 3650  577
23         73 2375 3704
24         74  832  768
25         75 1115 3141
26         76 3200 1509
27         77 1597 3043
28         78 3355 1248
29         79 3507  482
30         80  437 2452
31         81 2954 3830
32         82 2360 1155
33         83 2459 3465
34         84  745 2897
35         85  460 1245
36         86 2823 2714
37         87 1081 3779
38         88 3778 1400
39         89  555 2527
40         90 1129 2722
41         91 1919  414
42         92 1890 3352
43         93 2611 2679
44         94 2038  110
45         95 1002 2876
46         96  433  946
47         97 3231  983
48         98 1114 1850
49         99 2424 2579
50        100 3251 1007
51          1 2833 2554
210         2 2387  515
410         4 3209 2709
52          5    6 1766
61          6 2248 3847
71          7 2171 3014
81          8 3716  293
91          9 3459 1312
101        10 2091 3193
111        11 2799 3521
121        12 3309  198
131        13 2152  477
141        14  277 1771
151        15 3784  119
161        16 2707 1995
171        17    5 2592
181        18  528 1018
191        19  687  432
211        21 2187 1439
221        22 1185 3338
231        23  515 2484
241        24 1948 1326
261        26 3156 2764
271        27 3633 1718
281        28  176 3662
291        29  694  525
301        30  494  571
311        31 1069 1431
331        33 2108 3306
341        34  650 1535
361        36 2330  816
371        37 1094  642
381        38 1339 1419
391        39  968 2746
401        40 1590   24
411        41  785 3475
421        42 2885 2670
431        43  883 3636
441        44  585  476
451        45 2196 1648
461        46  354  613
471        47 1926  170
481        48 3240 1002
491        49 2245 3756
501        50 1947 3131
#Removing the indices that were moved from cart1 and calling it cart1.1
cart1.1 <- cart1 
cart1.1 <- cart1.1[-c(points_to_move),]

print(cart1.1)
   Object_ID    X    Y
3          3 3844 3216
20        20 1211  382
25        25 3636 2414
32        32 1778 1161
35        35 3571 2228
#Saving original points from cart1 and cart2 in a scatter plot
cart_combined <- rbind(cart1,cart2) 

cart_combined <- cart_combined %>% 
  mutate(cart = ifelse(Object_ID < 51,1,2)) 
cart_combined$cart <- as.factor(cart_combined$cart)

ggplot(cart_combined, aes(x = X, y = Y, color = cart)) +
  geom_point(size = 5) +
  geom_text(aes(label = Object_ID), vjust = -0.5, size = 6 , color = "black") +
  labs(title = "Scatter Plot of the dataset1 and dataset2",
       x = "X Distance",
       y = "Y Distance",
       color = "Dataset") +
  theme_minimal() + 
   theme(plot.title = element_text(size = 30 ,hjust = 0.5), plot.subtitle = element_text(size=45,hjust = 0.5), axis.title.x= element_text(size = 35),axis.title.y =element_text(size = 35),legend.title = element_text(size = 45),legend.text = element_text(size = 40),legend.key.size = unit(20, "pt"),axis.text.x = element_text(size = 38),axis.text.y = element_text(size = 38)) +
  guides(color = guide_legend(override.aes = list(size = 10))) 

#Showing the points after they have been moved from cart1 to cart2
cart_combined2 <- rbind(cart1.1,cart2.1) 

cart_combined2 <- cart_combined2 %>% 
  mutate(cart = ifelse(Object_ID == c(3,20,25,32,35),1,2)) 

cart_combined2$cart <- as.factor(cart_combined2$cart)

ggplot(cart_combined2, aes(x = X, y = Y, group = cart)) +
  geom_point(aes(color = cart),size = 5) +
  geom_text(aes(label = Object_ID), vjust = -0.5, size = 6 , color = "black") +
  labs(title = "Scatter Plot of the dataset1 and dataset2 after point have been moved",
       x = "X Distance",
       y = "Y Distance",
       color = "Dataset") +
 theme_minimal()  +
   theme(plot.title = element_text(size = 30 ,hjust = 0.5), plot.subtitle = element_text(size=45,hjust = 0.5), axis.title.x= element_text(size = 35),axis.title.y =element_text(size = 35),legend.title = element_text(size = 45),legend.text = element_text(size = 40),legend.key.size = unit(20, "pt"),axis.text.x = element_text(size = 38),axis.text.y = element_text(size = 38)) +
  guides(color = guide_legend(override.aes = list(size = 10)))